{ "cells": [ { "cell_type": "markdown", "id": "understood-threat", "metadata": {}, "source": [ "# Unterscheide Gedichte von Spam - Text Vectorization" ] }, { "cell_type": "code", "execution_count": null, "id": "7eb3bba1-7943-4028-9db1-abf81a983477", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import sklearn.linear_model\n", "import sklearn.pipeline\n", "import sklearn.feature_extraction" ] }, { "cell_type": "markdown", "id": "cac9fb7f-5093-4a3f-9f72-c5d7a33e0fe0", "metadata": {}, "source": [ "Lade DataFrame vom fremden Jupyter Notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "respective-broadcast", "metadata": { "deletable": false }, "outputs": [], "source": [ "%%capture\n", "%run \"01 Unterscheide Gedichte von Spam - naiver Ansatz.ipynb\"\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "3a6d81c1-9070-40b1-86b2-0cbcd12d0c24", "metadata": {}, "source": [ "Teile Datensatz auf." ] }, { "cell_type": "code", "execution_count": null, "id": "306f8d51-ea98-4d16-a60d-ac33b8a290e7", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n", " df[\"text\"].values, df[\"category\"].values, test_size=0.33, random_state=42\n", ")" ] }, { "cell_type": "markdown", "id": "b69221ff-abeb-4530-80e8-91a04b084784", "metadata": {}, "source": [ "Definiere Lern-Pipeline.\n", "Mehr Infos hierzu unter\n", "https://scikit-learn.org/1.5/datasets/real_world.html#the-20-newsgroups-text-dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "0fcc341a-6db2-40fd-8840-bb43e4eff288", "metadata": {}, "outputs": [], "source": [ "clf = sklearn.linear_model.LogisticRegression() # Definiere Klassifizierer\n", "\n", "pipeline = sklearn.pipeline.Pipeline([\n", " ('vect', sklearn.feature_extraction.text.CountVectorizer()), # Zähle Häufigkeit von Wörtern\n", " # ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()), # Verwende Logarithmus statt absolute Werte\n", " ('clf', clf),\n", "])" ] }, { "cell_type": "markdown", "id": "d9b47d9d-f301-4c8d-af2d-1afd70960945", "metadata": {}, "source": [ "Was macht der CountVector?\n", "Hier eine genauere Betrachung:" ] }, { "cell_type": "code", "execution_count": null, "id": "63a1ea3f-9ccb-4fe9-8f6c-b186a759edc4", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "vectorizer = sklearn.feature_extraction.text.CountVectorizer()\n", "X = vectorizer.fit_transform(X_train)\n", "pd.DataFrame(data={\n", " \"words\": vectorizer.get_feature_names_out(),\n", " **{\n", " f\"counts_entry_{i}\": X.toarray()[i]\n", " for i in range(len(X.toarray()))\n", " }\n", "}).set_index(\"words\").T" ] }, { "cell_type": "markdown", "id": "423cb509-08b0-4928-b449-8aec9c3fe536", "metadata": {}, "source": [ "Zur Erläuterung: `counts_entry_` steht für den i-ten Text im Datensatz." ] }, { "cell_type": "markdown", "id": "286d8c16-befd-427f-b9ad-e841a83577db", "metadata": {}, "source": [ "Überprüfen Sie, was der `sklearn.feature_extraction.text.TfidfTransformer` bringen würde.\n", "Informationen dazu finden Sie unter https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html." ] }, { "cell_type": "markdown", "id": "c3b1798c-610d-43dc-b7cf-7382b05fe59b", "metadata": {}, "source": [ "Aufgabe 1\n", "\n", "Sollte die Zeile `# ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()), # Verwende Logarithmus statt absolute Werte` weiter oben auskommentiert werden?" ] }, { "cell_type": "markdown", "id": "3578f1e6-a05b-43fe-ab65-63dac325934c", "metadata": {}, "source": [ "*Ihre Antwort:* ..." ] }, { "cell_type": "markdown", "id": "84587c22-e5a5-4ae7-8ffe-a3452b4da9dc", "metadata": {}, "source": [ "Trainiere den Klassifizierer `clf`. \n", "Die Rückgabe von `pipeline.score` entspricht der Rückgabe der Methode `score` des Klassifizierers `clf`." ] }, { "cell_type": "code", "execution_count": null, "id": "systematic-symposium", "metadata": {}, "outputs": [], "source": [ "pipeline.fit(X_train, y_train)\n", "pipeline.score(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "id": "599a9211-cc5c-4b59-bfdb-d9a302d0831c", "metadata": {}, "outputs": [], "source": [ "pipeline.score(X_test, y_test)" ] }, { "cell_type": "markdown", "id": "8e1b3ac5-1d4b-44db-805e-d0bd113c2a05", "metadata": {}, "source": [ "\"Creative     Dieses Werk von Marvin Kastner ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" }, "nbsphinx": { "execute": "always" } }, "nbformat": 4, "nbformat_minor": 5 }